The 20 words of maximal string length found in the Top-1.000 words ordered by rank
Local Rank | Rank in Wordlist | Word | Length |
---|---|---|---|
1 | 225 | އަޅުގަނޑުމެންގެ | 15 |
2 | 291 | ރައްޔިތުންނަށް | 14 |
3 | 299 | އަޅުގަނޑުމެންނަށް | 17 |
4 | 325 | މައްސަލައިގައި | 14 |
5 | 455 | މަސައްކަތްތައް | 14 |
6 | 483 | ބައްދަލުވުމުގައި | 16 |
7 | 525 | ކަންތައްތަކެއް | 14 |
8 | 563 | އަހަރެމެންނަށް | 14 |
9 | 565 | މަސައްކަތްތަކެއް | 16 |
10 | 599 | މަސައްކަތްކުރާ | 14 |
11 | 620 | ސަރަހައްދުގައި | 14 |
12 | 629 | ޗެމްޕިއަންކަން | 14 |
13 | 642 | ބައިނަލްއަގުވާމީ | 16 |
14 | 654 | އިންޓަނޭޝަނަލް | 14 |
15 | 697 | ފާހަގަކުރެއްވި | 14 |
16 | 747 | ހުށަހަޅާފައިވާ | 14 |
17 | 774 | މަސައްކަތުގައި | 14 |
18 | 809 | ގައުމުތަކުގައި | 14 |
19 | 953 | ރަސްމިއްޔާތުގައި | 16 |
20 | 959 | ކުޅުންތެރިންނަށް | 16 |
The most frequent 1000 words contain a many stopwords and the most frequent content words. Assuming that stopwords are usually short, the list presented here shows some important content words.
This list can be ordered not only by word length (as in this subsection), but also by rank or alphabetically. This will be done in the next subsections.
The content words in the table give a first impression of the main subject areas in a corpus.
In the case of very poor pre-processing, some non-words may appear in the list. But there will be much finer tests for poor pre-processing below.
The list above can be generated from a relatively small corpus. If we have a small corpus which is representative for a larger one, the corresponding results should not differ very much. Hence, the list could be generated quickly generated from a small randomly chosen subcorpus.
SELECT @num:=@num+1 as local_rank, w_id-100 as rank_in_wordlist,word,char_length(word) as len from (select @num:=0) xx, words where w_id>100 and w_id<=1100 order by len desc limit 50;
Longest Words in Top-1000 by rank
Longest Words in Top-1000 alphabetically